This project deals with Weather Forecating data recorded from 9am to the following day (9am) throughout 49 different locations in Australia. We will use Kmeans, Hierarchical Agglomerative Clustering and Decision Trees to predict whether or not it will rain the following day.
Even with my tenuous grasp of meteorology, I believe that components such as humidity, pressure, and location will be key players in predicting our target variable, RainTomorrow. At any rate, let's dive into the data.
import pandas as pd #importing some necessary packages for data preprocessing and visualization
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn import preprocessing
df = pd.read_csv('Weather Forecast Training.csv')
df.describe()
#Consider Log Transforming attributes with large ranges/scales (Rainfall, Evaporation, WindGustSpeed, Humidity)
#df.info() #4 categorical: Location, WindGustDir, RainToday and WindGust + Target Variable (Boolean)
import missingno as msno
missingdata_df = df.columns[df.isnull().any()].tolist()
msno.matrix(df[missingdata_df]);
msno.bar(df[missingdata_df], color="blue", log=False, figsize=(30,18))
plt.title("Percentage missing data")
msno.heatmap(df[missingdata_df], figsize=(20,20))
df.corr()
Instantly, I notice the difference in scale for most of our attributes. Having 16 dimensions that contain different units of measurement(mm Rainfall, km/hr Wind, hpa Pressure etc..) can have devastating affects on how our distance-based functions operate.
I also notice the presence of Numerical (continuous/discrete) and categorical data. The target variable is binary which is typical of classification problems normally,(opposed to clustering) but we will need to initially figure out the nature of our missing data.
Whether the absence of data is completely independent of observable params of interest(highly unlikely since forecasting data is almost always going to have some effect on other attributes (MCAR), Missing at Random (independent of RainTomorrow (unbiased)(unlikely) or Missing not at random (most likely here).
Missing not a random is common of real world data and could be due to data collection inconsistencies across multiple regions or the inference from one variable to another(Rainfall and rain i.e. if mm of Rainfall was calculated we obviously know if it rained or not).
Additionally, all attributes have some correlation to rainfall so missing values can sometime allow us to infer on other values(no attributes such as the color of house; all meteorological data are included). High humidity and low pressure for example are generally required for rainfall so if a value is missing it could be because a location automatically infers the other attribute.
Evaporation/Sunshine/Cloud also seems like a difficult thing to measure across an entire continent, perhaps due to cost of implementation/laziness.
Perhaps on extreme cases of weather (>50 mm RainFall) commonly missing attributes such as Evaporation may be taken.
Evaporation, Sunshine, Pressure, Temp have a (-) correlation to Rainfall which is intuitive given, the conditions you'd imagine rain to occur in.
MinTemp, WindGustSpeed,WindSpeed, Humidity, Cloud all have (+) relationships to rainfall
df.groupby('Location').count().head(5)
imp = ['MinTemp','MaxTemp','Rainfall',
'Evaporation','Sunshine','WindGustSpeed',
'WindSpeed','Humidity','Pressure',
'Cloud','Temp']
df.groupby('C')['A','B'].transform(lambda x : np.clip(x,x.quantile(0.05),x.quantile(0.95)))
for col in imp:
if df[col].median() > df[col].mean():
df[col] = df.groupby('Location')[col].transform(lambda x: x.fillna(x.mean()))
else:
df[col] = df.groupby('Location')[col].transform(lambda x: x.fillna(x.median()))
df.isnull().sum()
#didn't work for locations that didn't take measurements on Evaporation, Sunshine,
#WindGustSpeed, Pressure or Cloud measurements
df_wind = pd.concat([df,pd.get_dummies(df['WindGustDir'],prefix='WindGustDir')],axis=1)
df_wind = pd.concat([df_wind,pd.get_dummies(df['WindDir'],prefix='WindDir')],axis=1)
df_wind = df_wind[['WindGustDir_E','WindGustDir_ENE', 'WindGustDir_ESE', 'WindGustDir_N', 'WindGustDir_NE',
'WindGustDir_NNE', 'WindGustDir_NNW', 'WindGustDir_NW', 'WindGustDir_S',
'WindGustDir_SE', 'WindGustDir_SSE', 'WindGustDir_SSW',
'WindGustDir_SW', 'WindGustDir_W', 'WindGustDir_WNW', 'WindGustDir_WSW',
'WindDir_E', 'WindDir_ENE', 'WindDir_ESE', 'WindDir_N', 'WindDir_NE',
'WindDir_NNE', 'WindDir_NNW', 'WindDir_NW', 'WindDir_S', 'WindDir_SE',
'WindDir_SSE', 'WindDir_SSW', 'WindDir_SW', 'WindDir_W', 'WindDir_WNW',
'WindDir_WSW']]
import seaborn as sns
cor_obj = df_wind.corr()
cor_obj
sns.heatmap(cor_obj)
#fill WindGustDir with it's corresponding WindDir or vice versa because the two are highly correlated and differ by each observation (day-to-day)
df.WindGustDir.fillna(df.WindDir,inplace=True)
df.WindDir.fillna(df.WindGustDir,inplace=True)
df.isnull().sum()
#some locations don't take either metric so I will impute the remaining null values with mode of entire dataset
#most raintoday values have rainfall = 0. Additionally these missing values comprise only ~1.4 %
#of the overall dataset so I will fill the 747 missing RainToday with "No"
df.RainToday.fillna(df.RainToday.mode()[0],inplace=True)
Min/max spread on temp and delta shows us that desert locations have the largest spread, whilst coastal have less variation
temp_avg = df.groupby(['Location']).mean()[['MinTemp', 'MaxTemp']]
temp_avg['change'] = temp_avg['MaxTemp'] - temp_avg['MinTemp']
temp_avg.sort_values(by='change', ascending=True).plot(kind='barh', figsize=(15,10))
plt.show()
#for c in df.columns:
# fig = px.histogram(df,x=c, color = "RainToday")
# fig.show()
#Distribution Notes:
#Max/Min temp show fairly regular distributions and so does Temp. I will visualize this by city below to determine if
#how we should impute these missing values. Based on that we may drop either: Min/MaxTemp or Temp ~0% missing for all
#Right Skewed: Evap 0-75, Sunshine right skewed but also uniform 2-14, WindGustSpeed slightly right-skewed, 2 km/hr separation between each count. May be standard practice in some locations and not others
#Right Skewed:WindSpeed: Similar pattern to wind gust speed, cloud (0-8 okta: discrete),
#Left Skewed: Humidity 0-100 very low kurtosis, Pressure (Super high range Log Transform)
#Temp: Approximately Normal: 0-45
#WindDir/WindGustDir random (probably by location)
#plotting rain distribution
rain_dist = px.histogram(df, 'Rainfall')
rain_dist.show()
#plotting evaporation distribution
evap_dist = px.histogram(df, 'Evaporation')
evap_dist.show()
#plotting pressure distribution
pressure_dist = px.histogram(df, 'Pressure')
pressure_dist.show()
df['Log_Rainfall'] = df['Rainfall'].map(lambda x: np.log(x+2))
import plotly.express as px
log_rain = px.histogram(df,'Log_Rainfall')
log_rain.show()
df['Log_Evap'] = df['Evaporation'].map(lambda x: np.log(x+2))
#log_evap = px.histogram(df,'Log_Evap')
#log_evap.show()
df['Log_Pressure'] = df['Pressure'].map(lambda x: np.log(x+2))
log_evap = px.histogram(df,'Log_Pressure')
log_evap.show()
#dropping our log transformed variables
df = df.drop(['Rainfall','Pressure','Evaporation'], axis = 1)
#dropping categorical variables: didn't help any of our model's Precision/Recall scores
df = df.drop(['WindDir','WindGustDir','Location'], axis = 1)
df.isnull().sum()
I've used multiple imputations to account for the uncertainty association with between each imputation if I were to use KNearest Neighbors. Single imputation doesn't acount for that so MICE will help us find the most likely value overall.
MICE essentially uses multiple regression to map the distributions of each value in order to accurately impute what is missing. I've found that this method is very flexible and much less time consuming than if I were to use KNearest Neighbors.
Although Evaporation, Pressure, Cloud and Sunshine constitute a large portion of missing values in our data set, they do show a strong correlation to rainfall. I initially tried dropping these from attributes from my model, however recall was notably increased after I decided to use this imputation measure.
from fancyimpute import IterativeImputer
MICE_imputer = IterativeImputer()
df_MICE = df.copy().select_dtypes(include=[np.float])
df_MICE.iloc[:,:] = MICE_imputer.fit_transform(df_MICE)
df_MICE.info()
#rounding variables like cloud and sunshine to nearest integers and ensuring that our variables upper/lower limits are not exceeded.
#i.e.. can't have < 0 hours of sunshine or over 100% humidity in an observation
df_MICE['Cloud'] = df_MICE['Cloud'].apply(lambda x: round(x))
df_MICE['Sunshine'] = df_MICE['Sunshine'].apply(lambda x: round(x))
df_MICE["Sunshine"] = df_MICE["Sunshine"].clip(lower=0)
df_MICE["Cloud"] = df_MICE["Cloud"].clip(upper=8)
df_MICE["Humidity"] = df_MICE["Humidity"].clip(upper=100)
df_MICE['RainToday'] = df['RainToday'].values
df_MICE['RainTomorrow'] = df['RainTomorrow'].values
df = df_MICE[['MinTemp','MaxTemp','Log_Rainfall','Log_Evap','Sunshine','WindSpeed',
'WindGustSpeed','Humidity','Log_Pressure','Cloud','Temp','RainToday',
'RainTomorrow']]
#putting our Mice imputed data back into our core training dataset
df.describe()
#Mapping RainToday and RainTomorrow to boolean values for our clustering models to process
df["RainToday"] = df["RainToday"].map({'Yes': 1, 'No': 0})
df["RainTomorrow"] = df["RainTomorrow"].map({'Yes': 1, 'No': 0})
df.head(5)
The objective of k-means is to minimize the intra-class variance (sum of squared distances from each data point being clustered) and to produce the most coherent clusters.
Following clustering, we would like to measure things such as precision, recall and accuracy of our model. Precision corresponds to the True Positive rate over all observations that were classified as positive. Recall corresponds to the true positive rate over the (true positive and False Negative) cases.
We would like to maximize recall in these models because it is better to predict for rain when it doesn't rather than predict no for rain when it actually does. If people we were expecting to play golf and our model predicted blue skies the following day then it would be problematic if it rained. In our model the cost of False Positives isn't as high as false negatives. This idea will be elaborated later in the model.
#X_test, y_test, X_train, y_train
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df[['MinTemp','MaxTemp','Log_Rainfall','Log_Evap','Sunshine',
'WindGustSpeed','Humidity','Log_Pressure','WindSpeed',
'Cloud','Temp','RainToday']],df["RainTomorrow"],test_size=0.3, random_state=42)
#Running Kmeans with default parameters
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2, n_init=25, max_iter=100, random_state=42)
kmeans.fit(X_train, y_train)
from sklearn.cluster import KMeans
from sklearn import metrics
from scipy.spatial.distance import cdist
# k means determine k
distortions = []
K = range(1,10)
for k in K:
kmeanModel = KMeans(n_clusters=k).fit(df)
kmeanModel.fit(df)
distortions.append(kmeanModel.inertia_)
# Plot the elbow
plt.figure(figsize=(20,10))
plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distorition')
plt.annotate('Probable Elbow',
xy=(2,distortions[1]),
xytext=(2,distortions[4]),
textcoords='data',
fontsize=15,
arrowprops=dict(facecolor='black',shrink=0.1))
plt.title('The Elbow Method showing the optimal k')
plt.show()
from sklearn.model_selection import GridSearchCV
kmeans_grid = GridSearchCV(kmeans, param_grid = {'init': ['k-means++', 'random'],
'precompute_distances': ['auto',True,False],
'algorithm': ['auto','full','elkan']}, scoring='recall')
kmeans_grid.fit(X_train, y_train)
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2, n_init=25, max_iter=100, random_state=42, algorithm='full',init='k-means++',precompute_distances='auto')
kmeans.fit(X_train, y_train)
y_pred = kmeans.predict(X_test)
from sklearn import metrics
print(metrics.classification_report(y_test, (y_pred)))
import plotly.graph_objects as go
fig = go.Figure(layout = {"title": "Best Model Report for KMeans Clustering Algorithm using algorithm=full,init=k-means++"},
data=[go.Table(header=dict(values=['Accuracy','Negatives/Positives','precision', 'recall', 'f1-score','support']),
cells=dict(values=[[0.72],[0,1], [0.76, 0.68], [0.66, 0.78],[0.71,0.73],[8030,7564]]))
])
fig.show()
results = metrics.confusion_matrix(y_test,(y_pred))
print(results)
fig = go.Figure(layout = {"title": "Confusion Matrix for KMeans Clustering Algorithm"},
data=[go.Table(header=dict(values=['','Predicted Positives','Predicted Negatives']),
cells=dict(values=[['Actual Positives','Actual Negatives'],[5303,1687],[2729,5876]]))
])
fig.show()
In general, merges and splits of this model are greedily decided which can be represented in the form of a dendrogram. The Kmeans distance function is defaultly set to 'l2 or euclidean' however HAC can be manually changed to manhattan or minkowski.
For this case, I've decided to used a euclidean based distance function and ward linkage (which minimizes the variance of the clusters being merged). This is necessary in order to give us clusters that are more coherent in form. If things like Pressure or imputed values of sunshine/cloud are returning outliers, than we want to minimize the variance between each data point since we are only dealing with two clusters.
The standard algorithm has a time complexity of O(N)^2 which is why the model needed to be trained in google collab. Again, this data was not scaled given that precision and recall scores reverted to that of a coin flip which reflects that our data set was properly scaled and cleaned to begin with.
#importing a few necessary packages
import scipy
from scipy.cluster.hierarchy import dendrogram,linkage
from scipy.cluster.hierarchy import fcluster
from scipy.cluster.hierarchy import cophenet
from scipy.spatial.distance import pdist
import matplotlib.pyplot as plt
from pylab import rcParams
import seaborn as sb
from sklearn.cluster import AgglomerativeClustering
import sklearn.metrics as sm
from sklearn.preprocessing import scale
#Configure the output
np.set_printoptions(precision=4,suppress=True)
%matplotlib inline
rcParams["figure.figsize"] =20,10
sb.set_style("whitegrid")
#isolating our target variables in a data frame and dropping it for HAC
data = df
target = pd.DataFrame(data['RainTomorrow'])
data.drop(['RainTomorrow'],axis=1, inplace = True)
k = 2
#build the model
HClustering = AgglomerativeClustering(n_clusters= k, affinity="euclidean",linkage="ward")
#fit the model on the dataset
HClustering.fit(data)
#accuracy of the model
print("Accuracy Score: ", sm.accuracy_score(target,(HClustering.labels_)))
print("Precision Score: ",sm.precision_score(target,(HClustering.labels_)))
print("Recall Score: ",sm.recall_score(target,(HClustering.labels_)))
print("Accuracy Score: ", sm.accuracy_score(target,(HClustering.labels_)))
print("Precision Score: ",sm.precision_score(target,(HClustering.labels_)))
print("Recall Score: ",sm.recall_score(target,(HClustering.labels_)))
fig = go.Figure(layout = {"title": "Best HAC Model using 'L2'based disstance with 'ward-linkage'"},
data=[go.Table(header=dict(values=['Accuracy','Precision','Recall']),
cells=dict(values=[['71%'],['70%'],['71%']]))
])
fig.show()
#generate dendrogram
z = linkage(data,"ward")
#Plotting the dendrogram
dendrogram(z,truncate_mode= "lastp", p = 16, leaf_rotation=45,leaf_font_size=15, show_contracted=True)
plt.title("Truncated Hierachial Clustering Dendrogram")
plt.xlabel("Cluster Size")
plt.ylabel("Distance")
#divide the cluster
plt.axhline(y=450)
plt.axhline(350)
plt.axhline(200)
plt.show(125)
This dendrogram is showing us two natural clusters which is okay.
from sklearn.tree import DecisionTreeClassifier
d_tree = DecisionTreeClassifier()
d_tree
d_tree.fit(X_train, y_train)
y_pred = d_tree.predict(X_test)
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
param_grid = {'criterion': ['gini', 'entropy'],
'min_samples_split': [2, 10, 20],
'max_depth': [5, 10, 20, 25, 30],
'min_samples_leaf': [1, 5, 10],
'max_leaf_nodes': [2, 5, 10, 20]}
grid = GridSearchCV(d_tree, param_grid, cv=6, scoring='recall')
grid.fit(X_train, y_train)
for hps, values in grid.best_params_.items():
print(f"{hps}: {values}")
print(f"Precision: {round(metrics.precision_score(y_pred,y_test)*100)}%")
print(f"Recall: {round(metrics.recall_score(y_pred,y_test)*100)}%")
print(f"Accuracy: {round(metrics.accuracy_score(y_test, y_pred)*100)}%")
d_tree = DecisionTreeClassifier(criterion = 'gini',max_depth = 5,
max_leaf_nodes = 20,
min_samples_leaf = 1,
min_samples_split = 2)
d_tree
d_tree.fit(X_train, y_train)
y_pred = d_tree.predict(X_test)
d_tree.tree_.max_depth
from sklearn import metrics
print(f"Precision: {round(metrics.precision_score(y_pred,y_test)*100)}%")
print(f"Recall: {round(metrics.recall_score(y_pred,y_test)*100)}%")
print(f"Accuracy: {round(metrics.accuracy_score(y_test, y_pred)*100)}%")
#The baseline model has improved from 71% to 76% accuracy following the updated grid search.
from sklearn.tree import export_graphviz
# Export as dot file
export_graphviz(d_tree.fit(X_train,y_train), out_file='tree.dot',
feature_names = X_train.columns,
class_names = ['Rain_No','Rain_Yes'],
rounded = True, proportion = False,
precision = 2, filled = True)
# Convert to png using system command (requires Graphviz)
from subprocess import call
call(['dot', '-Tpng', 'tree.dot', '-o', 'tree.png', '-Gdpi=600'])
# Display in jupyter notebook
from IPython.display import Image
Image(filename = 'tree.png')
Optimization: The gini index (measures split quality) was able to help bring our model accuracy and recall up. max_depth reflects the maximum number of levels a decision tree can be split by. This limits our model. max_leaf nodes also helped us build a tree with n best nodes to reduce impurity.
from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score
# generate a no skill prediction (majority class)
ns_probs = [0 for _ in range(len(y_test))]
# predict probabilities
lr_probs = d_tree.predict_proba(X_test)
# keep probabilities for the positive outcome only
lr_probs = lr_probs[:, 1]
# calculate scores
ns_auc = roc_auc_score(y_test, ns_probs)
lr_auc = roc_auc_score(y_test, lr_probs)
# summarize scores
print('No Skill: ROC AUC=%.3f' % (ns_auc))
print('Decision Tree: ROC AUC=%.3f' % (lr_auc))
# calculate roc curves
ns_fpr, ns_tpr, _ = roc_curve(y_test, ns_probs)
lr_fpr, lr_tpr, _ = roc_curve(y_test, lr_probs)
# plot the roc curve for the model
plt.figure(figsize=(16,8))
plt.plot(ns_fpr, ns_tpr, linestyle='--', label='No Skill')
plt.plot(lr_fpr, lr_tpr, marker='.', label='Decision Tree')
# axis labels
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
# show the legend
plt.legend()
# show the plot
plt.show()
This model has 83% area under it's curve indicating the DT is much more accurate than if we were to simply flip a coin on the weather tomorrow.
fig = go.Figure(layout = {"title": "Best Decision Tree Model using gini, min_split = 1, max_leaf = 20 and max_depth = 5'"},
data=[go.Table(header=dict(values=['Accuracy','Precision','Recall']),
cells=dict(values=[['76%'],['72%'],['77%']]))
])
fig.show()
import pandas as pd
import numpy as np
df_test = pd.read_csv('Weather Forecast Testing.csv')
imp = ['MinTemp','MaxTemp','Rainfall',
'Evaporation','Sunshine','WindGustSpeed',
'WindSpeed','Humidity','Pressure',
'Cloud','Temp']
for col in imp:
if df_test[col].median() > df_test[col].mean():
df_test[col] = df_test.groupby('Location')[col].transform(lambda x: x.fillna(x.mean()))
else:
df_test[col] = df_test.groupby('Location')[col].transform(lambda x: x.fillna(x.median()))
df_test.WindGustSpeed.fillna(df_test.WindGustSpeed.mode()[0],inplace=True)
df_test.RainToday.fillna(df_test.RainToday.mode()[0],inplace=True)
df_test['Log_Rainfall'] = df_test['Rainfall'].map(lambda x: np.log(x+2))
df_test['Log_Evap'] = df_test['Evaporation'].map(lambda x: np.log(x+2))
df_test['Log_Pressure'] = df_test['Pressure'].map(lambda x: np.log(x+2))
df_test = df_test.drop(['Rainfall','Pressure','Evaporation','WindDir','WindGustDir','Location'], axis = 1)
df_test.head()
from fancyimpute import IterativeImputer
MICE_imputer = IterativeImputer()
df_testMICE = df_test.copy().select_dtypes(include=[np.float])
df_testMICE.iloc[:,:] = MICE_imputer.fit_transform(df_testMICE)
df_testMICE['Cloud'] = df_testMICE['Cloud'].apply(lambda x: round(x))
df_testMICE['Sunshine'] = df_testMICE['Sunshine'].apply(lambda x: round(x))
df_testMICE["Sunshine"] = df_testMICE["Sunshine"].clip(lower=0)
df_testMICE["Cloud"] = df_testMICE["Cloud"].clip(upper=8)
df_testMICE["Humidity"] = df_testMICE["Humidity"].clip(upper=100)
df_testMICE['RainToday'] = df_test['RainToday'].values
df_test = df_testMICE[['MinTemp','MaxTemp','Log_Rainfall','Log_Evap','Sunshine','WindSpeed',
'WindGustSpeed','Humidity','Log_Pressure','Cloud','Temp','RainToday']]
df_test["RainToday"] = df_test["RainToday"].map({'Yes': 1, 'No': 0})
df_test.head()
df_test.to_csv('df_test_final.csv')
df_output = pd.read_csv('Weather Forecast Testing.csv')
import pandas as pd
k = 2
HClustering = AgglomerativeClustering(n_clusters=k , affinity="euclidean",linkage="ward")
#fit the model on the dataset
HAC_test = HClustering.fit(df_test)
y_dt_pred = d_tree.predict(df_test)
y_kmeans_pred_test = (kmeans.predict(df_test))
y_HAC = HClustering.labels_
Final = pd.DataFrame(list(zip(df_output['ID'], y_kmeans_pred_test,y_HAC, y_dt_pred,)),columns=['ID','KMeans','HAC','DT'])
Final.head(50)
Final.to_csv('Predictions_Ondocin.csv', index=False)